Automated Annotation Workflow

This workflow uses the auto_annot tools from besca to newly annotate a scRNAseq dataset based on one or more preannotated datasets. Ideally, these datasets come from a similar tissue and condition.

We use supervised machine learning methods to annotate each individual cell utilizing methods like support vector machines (SVM) or logistic regression.

First, the traning dataset(s) and the testing dataset are loaded from h5ad files or made available as adata objects. Next, the training and testing datasets are corrected using scanorama, and the training datasets are then merged into one anndata object. Then, the classifier is trained utilizing the merged training data. Finally, the classifier is applied to the testing dataset to predict the cell types. If the testing dataset is already annotated (to test the algorithm), a report including confusion matrices can be generated.

In [1]:
import besca as bc
.local/lib/python3.7/site-packages/sklearn/externals/six.py:31: FutureWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).
  "(https://pypi.org/project/six/).", FutureWarning)
In [2]:
import scanpy as sc
import pkg_resources

test load datasets with scvelo

Apparently the scv loader makes sure the adata objects are all in comparable format whereas the sc loader loads them as is.

In [3]:
adata_test = bc.datasets.Granja2019_processed()
In [4]:
adata_test_orig  = bc.datasets.Granja2019_processed()
In [5]:
adata_train1 = bc.datasets.Kotliarov2020_processed()

Concatenation does not lead to errors when the scv loader is used.

In [6]:
adata_train_list = [adata_train1]

Parameter specification

Give your analysis a name.

In [7]:
analysis_name = 'auto_annot_pubimage_trainKtestG' # The analysis name will be used to name the output files

Specify column name of celltype annotation you want to train on.

In [8]:
celltype ='dblabel' # This needs to be a column in the .obs of the training datasets (and test dataset if you want to generate a report)

Choose a method:

  • linear: Support Vector Machine with Linear Kernel
  • sgd: Support Vector Machine with Linear Kernel using Stochastic Gradient Descent
  • rbf: Support Vector Machine with radial basis function kernel. Very time intensive, use only on small datasets.
  • logistic_regression: Standard logistic classifier iwth multinomial loss.
  • logistic_regression_ovr: Logistic Regression with one versus rest classification.
  • logistic_regression_elastic: Logistic Regression with elastic loss, cross validates among multiple l1 ratios.
In [9]:
method = 'logistic_regression'

Specify merge method. Needs to be either scanorama or naive.

In [10]:
merge = 'scanorama' # We recommend to use scanorama here

Decide if you want to use the raw format or highly variable genes. Raw increases computational time and does not necessarily improve predictions.

In [11]:
use_raw = False # We recommend to use False here

You can choose to only consider a subset of genes from a signature set or use all genes.

In [12]:
genes_to_use = 'all' # We suggest to use all here, but the runtime is strongly improved if you select an appropriate gene set

Column names need to be standardised so the function knows which columns to compare.

In [13]:
adata_train_list[0].obs["dblabel"] = adata_train_list[0].obs.celltype3
adata_test.obs["dblabel"] = adata_test.obs.celltype3
In [14]:
adata_test.obs.dblabel.unique()
Out[14]:
[naive thymus-derived CD4-positive, alpha-beta ..., classical monocyte, naive B cell, lymphocyte of B lineage, naive thymus-derived CD8-positive, alpha-beta ..., ..., IL7R-max CD8-positive, alpha-beta cytotoxic T ..., hematopoietic multipotent progenitor cell, myeloid leukocyte, basophil, plasma cell]
Length: 25
Categories (25, object): [naive thymus-derived CD4-positive, alpha-beta ..., classical monocyte, naive B cell, lymphocyte of B lineage, ..., hematopoietic multipotent progenitor cell, myeloid leukocyte, basophil, plasma cell]
In [15]:
adata_train_list[0].obs.dblabel.unique()
Out[15]:
[cytotoxic CD56-dim natural killer cell, naive thymus-derived CD8-positive, alpha-beta ..., naive thymus-derived CD4-positive, alpha-beta ..., classical monocyte, CD8-positive, alpha-beta cytotoxic T cell, ..., regulatory T cell, CD1c-positive myeloid dendritic cell, plasmacytoid dendritic cell, erythrocyte, plasma cell]
Length: 14
Categories (14, object): [cytotoxic CD56-dim natural killer cell, naive thymus-derived CD8-positive, alpha-beta ..., naive thymus-derived CD4-positive, alpha-beta ..., classical monocyte, ..., CD1c-positive myeloid dendritic cell, plasmacytoid dendritic cell, erythrocyte, plasma cell]
In [16]:
adata_test.var.dtypes
Out[16]:
ENSEMBL           object
SYMBOL            object
feature_type    category
n_cells            int64
total_counts     float32
frac_reads       float32
dtype: object
In [17]:
adata_train_list[0].var.dtypes
Out[17]:
ENSEMBL         category
SYMBOL            object
feature_type    category
n_cells          float64
total_counts     float32
frac_reads       float32
dtype: object

Correct datasets (e.g. using scanorama) and merge training datasets

This function merges training datasets, removes unwanted genes, and if scanorama is used corrects for datasets.

In [18]:
adata_train, adata_test_corrected = bc.tl.auto_annot.merge_data(adata_train_list, adata_test, genes_to_use = genes_to_use, merge = merge)
merging with scanorama
using scanorama rn
Found 640 genes among all datasets
[[0.         0.62221009]
 [0.         0.        ]]
Processing datasets (0, 1)
integrating training set
calculating intersection

Train the classifier

The returned scaler is fitted on the training dataset (to zero mean and scaled to unit variance). The scaling will then be applied to the counts in the testing dataset and then the classifier is applied to the scaled testing dataset (see next step, adata_predict()). This function will run multiple jobs in parallel if if logistic regression was specified as method.

In [19]:
classifier, scaler = bc.tl.auto_annot.fit(adata_train, method, celltype, njobs=10)
[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   5 out of   5 | elapsed:  3.3min finished

Prediction

If in addition to the most likely class you would like to have all class probabilities returned use the following function. (This is only a sensible choice if using logistic regression.)

In [20]:
adata_predicted = bc.tl.auto_annot.adata_pred_prob(classifier = classifier, scaler = scaler, adata_pred = adata_test_corrected, adata_orig = adata_test_orig, threshold = 0.0)

Output

The adata object that includes the predicted cell type annotation can be written out as h5ad file.

In [21]:
adata_predicted.write('./adata_predicted_trainKtestG.h5ad')
... storing 'auto_annot' as categorical

If the testing dataset included already a cell type annotation, a report can be generated and written, which includes metrics, confusion matrices and comparative umap plots.

In [22]:
adata_predicted.obs
Out[22]:
Group nUMI_pre nUMI nGene initialClusters UMAP1 UMAP2 Clusters BioClassification Barcode ... cytotoxic CD56-dim natural killer cell erythrocyte memory B cell naive B cell naive thymus-derived CD4-positive, alpha-beta T cell naive thymus-derived CD8-positive, alpha-beta T cell non-classical monocyte plasma cell plasmacytoid dendritic cell regulatory T cell
BMMC_10x_GREENLEAF_REP1:AAACCCAAGATGCAGC-1 BMMC_D1T1 3982 2433 1352 Cluster20 6.333280 -1.546073 Cluster22 22_CD4.M AAACCCAAGATGCAGC-1 ... 6.378080e-04 2.121509e-04 2.299687e-03 8.593314e-05 0.311226 3.362999e-02 1.536731e-04 1.073661e-04 1.273278e-04 0.648373
BMMC_10x_GREENLEAF_REP1:AAACCCACAAACTCGT-1 BMMC_D1T1 6530 5106 2001 Cluster17 -6.626036 -5.624946 Cluster11 11_CD14.Mono.1 AAACCCACAAACTCGT-1 ... 4.998815e-05 3.161103e-07 3.818937e-07 6.769557e-07 0.004718 5.517141e-06 2.724748e-05 1.429384e-07 1.026657e-06 0.001603
BMMC_10x_GREENLEAF_REP1:AAACCCACAGTGTACT-1 BMMC_D1T1 4435 3589 1441 Cluster17 -7.221331 -5.927391 Cluster11 11_CD14.Mono.1 AAACCCACAGTGTACT-1 ... 6.386218e-06 8.387993e-09 3.992604e-08 1.539377e-06 0.000947 1.861379e-07 2.224607e-04 4.151118e-09 3.542656e-07 0.000059
BMMC_10x_GREENLEAF_REP1:AAACCCATCGCTATTT-1 BMMC_D1T1 5119 3603 1809 Cluster31 -0.538453 12.052552 Cluster16 16_Pre.B AAACCCATCGCTATTT-1 ... 1.339627e-07 1.561316e-05 9.860716e-01 1.351204e-02 0.000161 1.175169e-05 2.157746e-07 1.388064e-04 4.223628e-06 0.000032
BMMC_10x_GREENLEAF_REP1:AAACGAACACCCAATA-1 BMMC_D1T1 2748 2065 1106 Cluster5 -3.617507 8.777384 Cluster15 15_CLP.2 AAACGAACACCCAATA-1 ... 6.215394e-05 2.385034e-03 2.775454e-02 9.457179e-01 0.010348 5.808132e-03 2.419661e-04 1.516686e-03 1.133244e-03 0.003265
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
CD34_32_R5:TTTGCGCTCAGGCAAG-1 CD34_D2T1 7519 3425 1839 Cluster3 -7.776848 8.070736 Cluster1 01_HSC TTTGCGCTCAGGCAAG-1 ... 2.476810e-03 7.059291e-01 9.883694e-03 1.581899e-02 0.238185 4.877589e-03 4.005983e-04 3.981916e-04 7.755575e-04 0.010074
CD34_32_R5:TTTGGTTCAATCCGAT-1 CD34_D2T1 2658 1933 1243 Cluster6 -13.658864 3.211881 Cluster2 02_Early.Eryth TTTGGTTCAATCCGAT-1 ... 4.835098e-04 9.353054e-02 3.345350e-04 8.222923e-03 0.812504 5.705371e-02 1.039745e-04 3.024653e-04 4.919430e-04 0.014081
CD34_32_R5:TTTGTCAGTAGAAGGA-1 CD34_D2T1 11973 5599 2531 Cluster1 -7.483875 5.000548 Cluster5 05_CMP.LMPP TTTGTCAGTAGAAGGA-1 ... 1.095417e-03 6.871403e-01 1.934323e-03 1.742111e-05 0.050563 2.354034e-01 2.559435e-05 2.744491e-03 2.798613e-04 0.001364
CD34_32_R5:TTTGTCATCCACGCAG-1 CD34_D2T1 15348 7081 2983 Cluster1 -7.554902 3.831676 Cluster5 05_CMP.LMPP TTTGTCATCCACGCAG-1 ... 5.822846e-04 8.763712e-01 3.315870e-03 1.121812e-05 0.035575 7.642585e-02 9.679483e-06 8.551016e-04 5.005859e-05 0.000165
CD34_32_R5:TTTGTCATCGCTTGTC-1 CD34_D2T1 9995 4918 2319 Cluster2 -8.528916 0.883755 Cluster8 08_GMP.Neut TTTGTCATCGCTTGTC-1 ... 3.478134e-03 3.039101e-01 1.486180e-03 6.436302e-04 0.504826 1.139371e-01 1.753510e-04 2.301701e-03 5.918303e-04 0.005041

34813 rows × 36 columns

In [23]:
adata_predicted = bc.st.clustering(adata_predicted, '.')
leiden clustering performed with a resolution of 1
WARNING: saving figure to file figures/umap.leiden.png
rank genes per cluster calculated using method wilcoxon.
mapping of cells to  leiden exported successfully to cell2labels.tsv
average.gct exported successfully to file
fract_pos.gct exported successfully to file
labelinfo.tsv successfully written out
./labelings/leiden/WilxRank.gct written out
./labelings/leiden/WilxRank.pvalues.gct written out
./labelings/leiden/WilxRank.logFC.gct written out
In [24]:
%matplotlib inline
sc.settings.set_figure_params(dpi=90)
bc.tl.report(adata_pred=adata_predicted, celltype=celltype, method=method, analysis_name=analysis_name,
                        train_datasets=adata_train_list, test_dataset=adata_test_orig, merge=merge, use_raw=False,
                        genes_to_use=genes_to_use, remove_nonshared=True, clustering='leiden', asymmetric_matrix=True)
WARNING: saving figure to file figures/umap.ondata_auto_annot_pubimage_trainKtestG.png
WARNING: saving figure to file figures/umap.auto_annot_pubimage_trainKtestG.png
Confusion matrix, without normalization
Normalized confusion matrix
In [25]:
sc.settings.set_figure_params(dpi=240)

sc.pl.umap(adata_predicted, color=[celltype, 'auto_annot', 'leiden'], legend_loc='on data',legend_fontsize=7,  save= '.fig4_supp_trainKtestGondata.svg')
sc.pl.umap(adata_predicted, color=[celltype, 'auto_annot', 'leiden'],legend_fontsize=7, wspace = 1.4, save = '.fig4_supp_trainKtestG.svg')
WARNING: saving figure to file figures/umap.fig4_supp_trainKtestGondata.svg
WARNING: saving figure to file figures/umap.fig4_supp_trainKtestG.svg
In [26]:
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(y_true, y_pred, classes, celltype,
                          normalize=False,
                          title=None, numbers =False,
                          cmap=plt.cm.Blues, adata_predicted= None, asymmetric_matrix = True): 

    matplotlib.use('Agg')
    
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    #classes = classes[unique_labels(y_true, y_pred)]
    if asymmetric_matrix == True:
        class_names =  np.unique(np.concatenate((adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'])))
        class_names_orig = np.unique(adata_predicted.obs[celltype])
        class_names_pred = np.unique(adata_predicted.obs['auto_annot'])
        test_celltypes_ind = np.searchsorted(class_names, class_names_orig)
        train_celltypes_ind = np.searchsorted(class_names, class_names_pred)
        cm=cm[test_celltypes_ind,:][:,train_celltypes_ind]
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    fig, ax = plt.subplots(figsize=(15,15))
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax, shrink = 0.8)
    # We want to show all ticks...
    if asymmetric_matrix == True:
        ax.set(xticks=np.arange(cm.shape[1]),
               yticks=np.arange(cm.shape[0]),
               # ... and label them with the respective list entries
               xticklabels=class_names_pred, yticklabels=class_names_orig,
               title=title,
               ylabel='True label',
               xlabel='Predicted label')
    else:
        ax.set(xticks=np.arange(cm.shape[1]),
               yticks=np.arange(cm.shape[0]),
               # ... and label them with the respective list entries
               xticklabels=classes, yticklabels=classes,
               title=title,
               ylabel='True label',
               xlabel='Predicted label')
        
    ax.grid(False)
    #ax.tick_params(axis='both', which='major', labelsize=10)
    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    if numbers == True:
        fmt = '.2f' if normalize else 'd'
        thresh = cm.max() / 2.
        for i in range(cm.shape[0]):
            for j in range(cm.shape[1]):
                ax.text(j, i, format(cm[i, j], fmt),
                        ha="center", va="center",
                        color="white" if cm[i, j] > thresh else "black")
    #fig.tight_layout()
    return ax
In [27]:
import os
In [28]:
# make conf matrices (4)
class_names =  np.unique(np.concatenate((adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'])))
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plot_confusion_matrix(adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'], title = " ", classes=class_names, celltype=celltype ,numbers = False, adata_predicted = adata_predicted, asymmetric_matrix = True)
plt.savefig(os.path.join('fig4_supp_trainKtestG_confusion_matrix_nonnormalised.svg'))

# Plot normalized confusion matrix with numbers
plot_confusion_matrix(adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'], title = " ", classes=class_names,celltype=celltype,  normalize=True, numbers = False, adata_predicted = adata_predicted, asymmetric_matrix = True)
plt.savefig(os.path.join('fig4_supp_trainKtestG_confusion_matrix_normalised.svg'))
Confusion matrix, without normalization
Normalized confusion matrix

let's use a threshold

In [30]:
analysis_name = 'auto_annot_pubimage_threshold_trainKtestG' # The analysis name will be used to name the output files
In [31]:
adata_predicted_threshold = bc.tl.auto_annot.adata_pred_prob(classifier = classifier, scaler = scaler, adata_pred = adata_test_corrected, adata_orig = adata_test_orig, threshold = 0.7)
In [32]:
adata_predicted_threshold.write('./adata_predicted_threshold_trainKtestG.h5ad')
... storing 'auto_annot' as categorical
In [33]:
%matplotlib inline
sc.settings.set_figure_params(dpi=90)
bc.tl.report(adata_pred=adata_predicted_threshold, celltype=celltype, method=method, analysis_name=analysis_name,
                        train_datasets=adata_train_list, test_dataset=adata_test_orig, merge=merge, use_raw=False,
                        genes_to_use=genes_to_use, remove_nonshared=True, clustering='leiden', asymmetric_matrix=True)
WARNING: saving figure to file figures/umap.ondata_auto_annot_pubimage_threshold_trainKtestG.png
WARNING: saving figure to file figures/umap.auto_annot_pubimage_threshold_trainKtestG.png
Confusion matrix, without normalization
Normalized confusion matrix
In [34]:
sc.settings.set_figure_params(dpi=240)

sc.pl.umap(adata_predicted_threshold, color=[celltype, 'auto_annot', 'leiden'], legend_loc='on data',legend_fontsize=7,  save= '.fig4_supp_trainKtestG_threshold_ondata.svg')
sc.pl.umap(adata_predicted_threshold, color=[celltype, 'auto_annot', 'leiden'],legend_fontsize=7, wspace = 1.4, save = '.fig4_supp_trainKtestG_threshold.svg')
WARNING: saving figure to file figures/umap.fig4_supp_trainKtestG_threshold_ondata.svg
WARNING: saving figure to file figures/umap.fig4_supp_trainKtestG_threshold.svg
In [35]:
# make conf matrices (4)
class_names =  np.unique(np.concatenate((adata_predicted_threshold.obs[celltype], adata_predicted_threshold.obs['auto_annot'])))
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plot_confusion_matrix(adata_predicted_threshold.obs[celltype], adata_predicted_threshold.obs['auto_annot'], title = " ", classes=class_names, celltype=celltype ,numbers = False, adata_predicted = adata_predicted_threshold, asymmetric_matrix = True)
plt.savefig(os.path.join('fig4_trainKtestG_confusion_matrix_threshold_nonnormalised.svg'))

# Plot normalized confusion matrix with numbers
plot_confusion_matrix(adata_predicted_threshold.obs[celltype], adata_predicted_threshold.obs['auto_annot'], title = " ", classes=class_names,celltype=celltype,  normalize=True, numbers = False, adata_predicted = adata_predicted_threshold, asymmetric_matrix = True)
plt.savefig(os.path.join('fig4_trainKtestG_confusion_matrix_threshold_normalised.svg'))
Confusion matrix, without normalization
Normalized confusion matrix
In [37]:
adata_predicted_wo_unknown = adata_predicted_threshold.copy()
adata_predicted_wo_unknown = bc.subset_adata(adata_predicted_wo_unknown, adata_predicted_wo_unknown.obs.auto_annot != 'unknown', raw=False)
bc.pl.riverplot_2categories(adata_predicted_wo_unknown, [celltype, 'auto_annot'])

let's check if the differences in annotation make sense

In [38]:
gmt_file_IMM=pkg_resources.resource_filename('besca', 'datasets/genesets/HumanCD45p_scseqCMs6.gmt')
bc.tl.sig.combined_signature_score(adata_predicted, gmt_file_IMM)
WARNING: genes are not in var_names and ignored: ['FCRL4']
WARNING: genes are not in var_names and ignored: ['FCGR3']
WARNING: genes are not in var_names and ignored: ['CCR3']
WARNING: genes are not in var_names and ignored: ['CASP8AP2', 'DSSC1']
WARNING: genes are not in var_names and ignored: ['CSK2']
WARNING: genes are not in var_names and ignored: ['FAP', 'DCN', 'COL1A2', 'CXCL14', 'LUM', 'COL3A1', 'DPT', 'ISLR', 'PODN', 'FDF7', 'PDGFRL']
WARNING: genes are not in var_names and ignored: ['TNFA', 'IL4', 'IL7A', 'IL8', 'IL12', 'IL13', 'IL21', 'IL22', 'IL23', 'CXCL5', 'CXCL9', 'CXCL11', 'CXCL12', 'CXCL13', 'CX3CL1', 'GM-CSF', 'GCSFCCL1', 'CCL7', 'CCL11', 'CCL12', 'CCL13', 'CCL17', 'CCL19', 'CCL22', 'CCL24', 'CCL26', 'CCL27', 'SDF1A', 'BCA1', 'MIP1B']
WARNING: genes are not in var_names and ignored: ['LY6C1', 'SIGLECH']
WARNING: genes are not in var_names and ignored: ['CDH5', 'ITCAM1', 'ITGB3', 'KDR', 'PECAM1', 'SELE']
WARNING: genes are not in var_names and ignored: ['CDH5', 'ITGB3', 'KDR', 'PECAM1', 'SELE']
WARNING: genes are not in var_names and ignored: ['PECAM1', 'CDH5', 'ECSCR', 'CCL14', 'SLCO2A1', 'KDR', 'FABP4', 'SDPR']
WARNING: genes are not in var_names and ignored: ['CCR3', 'IL9R', 'SLIGLEC10', 'SIGLEC8']
WARNING: genes are not in var_names and ignored: ['KRT19']
WARNING: genes are not in var_names and ignored: ['TILPL2']
WARNING: genes are not in var_names and ignored: ['HLA-H', 'HLA-L', 'HLA-DRB2']
WARNING: genes are not in var_names and ignored: ['OAS1G']
WARNING: genes are not in var_names and ignored: ['CXCL9', 'IDO1']
WARNING: genes are not in var_names and ignored: ['ADGRE1']
WARNING: genes are not in var_names and ignored: ['ITGB3', 'PECAM1']
WARNING: genes are not in var_names and ignored: ['MIA', 'TYR', 'SLC45A2', 'CDH19', 'SLC24A5', 'MAGEA6', 'GJB1', 'PLP1', 'PRAME', 'PAX3', 'MLANA']
WARNING: genes are not in var_names and ignored: ['SLIT2', 'BGN', 'TNC', 'CYR6', 'GFRA3', 'SLITRK6']
WARNING: genes are not in var_names and ignored: ['IGHG1', 'IGHG2', 'IGHA1']
WARNING: genes are not in var_names and ignored: ['FCGR3']
WARNING: genes are not in var_names and ignored: ['FCGR3']
WARNING: genes are not in var_names and ignored: ['FCGR3', 'FCGR1']
WARNING: genes are not in var_names and ignored: ['FCGR4', 'FCGR1']
WARNING: genes are not in var_names and ignored: ['CD80', 'LY6G', 'CD177']
WARNING: genes are not in var_names and ignored: ['TRDC']
WARNING: genes are not in var_names and ignored: ['IGHD', 'IGHM']
WARNING: genes are not in var_names and ignored: ['CEACAM8']
WARNING: genes are not in var_names and ignored: ['ITGA8', 'IGJ']
WARNING: genes are not in var_names and ignored: ['SOX9']
WARNING: genes are not in var_names and ignored: ['SOX9']
WARNING: genes are not in var_names and ignored: ['MMP1', 'PDGFRA', 'PECAM1']
WARNING: genes are not in var_names and ignored: ['TRADO']
WARNING: genes are not in var_names and ignored: ['CXCL12']
WARNING: genes are not in var_names and ignored: ['CXCL11', 'CXCL9']
WARNING: genes are not in var_names and ignored: ['ANGTPL4', 'CXCL5']
WARNING: genes are not in var_names and ignored: ['CSF2', 'SPP4', 'IFNA1', 'TNFSF11']
WARNING: genes are not in var_names and ignored: ['IL17A', 'IL21', 'IL22']
WARNING: genes are not in var_names and ignored: ['CCR3', 'CCR8', 'CSF2', 'CSCR4', 'HAVCR1', 'IL13', 'IL4', 'IL5']
WARNING: genes are not in var_names and ignored: ['TRAC']
WARNING: genes are not in var_names and ignored: ['TRGC1', 'TRDC', 'TRDV2', 'TRDV1']
WARNING: provided gene list has length 0, scores as 0
WARNING: genes are not in var_names and ignored: ['TRGC2']
WARNING: genes are not in var_names and ignored: ['CXCL9']
WARNING: genes are not in var_names and ignored: ['FLAMF1']
WARNING: genes are not in var_names and ignored: ['LRRC32']
WARNING: genes are not in var_names and ignored: ['CCXR3']
WARNING: genes are not in var_names and ignored: ['CCL22', 'CCL17', 'CCL19']
WARNING: genes are not in var_names and ignored: ['XCR1']
WARNING: genes are not in var_names and ignored: ['PLET1', 'XCR1']
WARNING: genes are not in var_names and ignored: ['SIGLECG', 'PLET1', 'PPP1R1A']
WARNING: genes are not in var_names and ignored: ['SIGLECH']
WARNING: genes are not in var_names and ignored: ['C7', 'SIGLECG']
In [39]:
adata_predicted.var_names
Out[39]:
Index(['HES4', 'ISG15', 'TNFRSF18', 'TNFRSF4', 'MXRA8', 'MMP23B', 'PLCH2',
       'MEGF6', 'GPR153', 'TNFRSF25',
       ...
       'AIRE', 'ITGB2-AS1', 'ADARB1', 'COL18A1', 'COL6A1', 'COL6A2', 'FTCD',
       'MCM3AP-AS1', 'DIP2A', 'S100B'],
      dtype='object', length=1725)
In [40]:
scores = [x for x in adata_predicted.obs.columns if 'CD45' in x]
scores
Out[40]:
['score_HumanCD45p_scseqCMs6_ActB_scanpy',
 'score_HumanCD45p_scseqCMs6_Activation_scanpy',
 'score_HumanCD45p_scseqCMs6_Basophil_scanpy',
 'score_HumanCD45p_scseqCMs6_Bcells_scanpy',
 'score_HumanCD45p_scseqCMs6_CCG1S_scanpy',
 'score_HumanCD45p_scseqCMs6_CCG2M_scanpy',
 'score_HumanCD45p_scseqCMs6_Cafs_scanpy',
 'score_HumanCD45p_scseqCMs6_Cellcycle_scanpy',
 'score_HumanCD45p_scseqCMs6_Checkpoint_scanpy',
 'score_HumanCD45p_scseqCMs6_Cyto_scanpy',
 'score_HumanCD45p_scseqCMs6_Cytotox_scanpy',
 'score_HumanCD45p_scseqCMs6_DCR_scanpy',
 'score_HumanCD45p_scseqCMs6_DCrec_scanpy',
 'score_HumanCD45p_scseqCMs6_DCs_scanpy',
 'score_HumanCD45p_scseqCMs6_Eff_scanpy',
 'score_HumanCD45p_scseqCMs6_Endo_scanpy',
 'score_HumanCD45p_scseqCMs6_Endot_scanpy',
 'score_HumanCD45p_scseqCMs6_Endothelial_scanpy',
 'score_HumanCD45p_scseqCMs6_Eosinophil_scanpy',
 'score_HumanCD45p_scseqCMs6_Epith_scanpy',
 'score_HumanCD45p_scseqCMs6_ExhB_scanpy',
 'score_HumanCD45p_scseqCMs6_Granulo_scanpy',
 'score_HumanCD45p_scseqCMs6_HLA_scanpy',
 'score_HumanCD45p_scseqCMs6_HLAP_scanpy',
 'score_HumanCD45p_scseqCMs6_HLAS_scanpy',
 'score_HumanCD45p_scseqCMs6_Ifi_scanpy',
 'score_HumanCD45p_scseqCMs6_Ifng_scanpy',
 'score_HumanCD45p_scseqCMs6_Macrophage_scanpy',
 'score_HumanCD45p_scseqCMs6_Mast_scanpy',
 'score_HumanCD45p_scseqCMs6_Megakaryocytes_scanpy',
 'score_HumanCD45p_scseqCMs6_MelMelan_scanpy',
 'score_HumanCD45p_scseqCMs6_MelMesen_scanpy',
 'score_HumanCD45p_scseqCMs6_MemB_scanpy',
 'score_HumanCD45p_scseqCMs6_Memory_scanpy',
 'score_HumanCD45p_scseqCMs6_Mo14_scanpy',
 'score_HumanCD45p_scseqCMs6_Mo16_scanpy',
 'score_HumanCD45p_scseqCMs6_MoMa_scanpy',
 'score_HumanCD45p_scseqCMs6_Monocytes_scanpy',
 'score_HumanCD45p_scseqCMs6_Myelo_scanpy',
 'score_HumanCD45p_scseqCMs6_MyeloSubtype_scanpy',
 'score_HumanCD45p_scseqCMs6_NKT_scanpy',
 'score_HumanCD45p_scseqCMs6_NKcells_scanpy',
 'score_HumanCD45p_scseqCMs6_NKcyt_scanpy',
 'score_HumanCD45p_scseqCMs6_NKnai_scanpy',
 'score_HumanCD45p_scseqCMs6_Naive_scanpy',
 'score_HumanCD45p_scseqCMs6_NaiveB_scanpy',
 'score_HumanCD45p_scseqCMs6_Neutrophil_scanpy',
 'score_HumanCD45p_scseqCMs6_NonEff_scanpy',
 'score_HumanCD45p_scseqCMs6_OMyelo_scanpy',
 'score_HumanCD45p_scseqCMs6_Others_scanpy',
 'score_HumanCD45p_scseqCMs6_Plasma_scanpy',
 'score_HumanCD45p_scseqCMs6_Pyro_scanpy',
 'score_HumanCD45p_scseqCMs6_Stemmess_scanpy',
 'score_HumanCD45p_scseqCMs6_StemmessS_scanpy',
 'score_HumanCD45p_scseqCMs6_Stromal_scanpy',
 'score_HumanCD45p_scseqCMs6_T4CM_scanpy',
 'score_HumanCD45p_scseqCMs6_TAM_scanpy',
 'score_HumanCD45p_scseqCMs6_TAMCx_scanpy',
 'score_HumanCD45p_scseqCMs6_TEM_scanpy',
 'score_HumanCD45p_scseqCMs6_TMO_scanpy',
 'score_HumanCD45p_scseqCMs6_TMid_scanpy',
 'score_HumanCD45p_scseqCMs6_TNK_scanpy',
 'score_HumanCD45p_scseqCMs6_TStem_scanpy',
 'score_HumanCD45p_scseqCMs6_TStemhi_scanpy',
 'score_HumanCD45p_scseqCMs6_TSteml_scanpy',
 'score_HumanCD45p_scseqCMs6_TStemlo_scanpy',
 'score_HumanCD45p_scseqCMs6_TTh1_scanpy',
 'score_HumanCD45p_scseqCMs6_TTh17_scanpy',
 'score_HumanCD45p_scseqCMs6_TTh2_scanpy',
 'score_HumanCD45p_scseqCMs6_Tcd4_scanpy',
 'score_HumanCD45p_scseqCMs6_Tcd8_scanpy',
 'score_HumanCD45p_scseqCMs6_Tcells_scanpy',
 'score_HumanCD45p_scseqCMs6_Tcgd_scanpy',
 'score_HumanCD45p_scseqCMs6_Tcytox_scanpy',
 'score_HumanCD45p_scseqCMs6_Teff_scanpy',
 'score_HumanCD45p_scseqCMs6_Tfh_scanpy',
 'score_HumanCD45p_scseqCMs6_TilCM_scanpy',
 'score_HumanCD45p_scseqCMs6_Tpexh_scanpy',
 'score_HumanCD45p_scseqCMs6_Treg_scanpy',
 'score_HumanCD45p_scseqCMs6_Ttexh_scanpy',
 'score_HumanCD45p_scseqCMs6_Ubi_scanpy',
 'score_HumanCD45p_scseqCMs6_UnivExh_scanpy',
 'score_HumanCD45p_scseqCMs6_UnivMem_scanpy',
 'score_HumanCD45p_scseqCMs6_UnivNaive_scanpy',
 'score_HumanCD45p_scseqCMs6_aDCs_scanpy',
 'score_HumanCD45p_scseqCMs6_allSteml_scanpy',
 'score_HumanCD45p_scseqCMs6_cDC1_scanpy',
 'score_HumanCD45p_scseqCMs6_cDC2_scanpy',
 'score_HumanCD45p_scseqCMs6_cDCs_scanpy',
 'score_HumanCD45p_scseqCMs6_epDCs_scanpy',
 'score_HumanCD45p_scseqCMs6_general_scanpy',
 'score_HumanCD45p_scseqCMs6_moDC_scanpy',
 'score_HumanCD45p_scseqCMs6_pDCs_scanpy',
 'score_HumanCD45p_scseqCMs6_uDCs_scanpy']

Indeed it seems like the classification of B cells is an improvement, whereas the varieties of T cells pose difficulties.

In [41]:
sc.pl.umap(adata_predicted, color= ["score_HumanCD45p_scseqCMs6_MemB_scanpy", "score_HumanCD45p_scseqCMs6_NaiveB_scanpy","CD4", "CD8A"], ncols = 2, wspace = 0.4, color_map = 'viridis',save= '.fig4_markers.svg')
WARNING: saving figure to file figures/umap.fig4_markers.svg
In [42]:
sc.pl.umap(adata_predicted, color= ["IL7R"], ncols = 2, wspace = 0.4, color_map = 'viridis')
In [ ]: